## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Destribution looks like normal destribution. Most values are composed of 5 and 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The graph is relatively skewed to right, but no extreme outlier.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
At first, I thought it was just normal destribution, but when I looked at data with small binwidth, it turned out to be bimodal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
This distribution is skewed to right. I tried to transform data by using scale_x_sqrt ot scale_x_log10, but it did not change a shape nicely.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
In order to take closer look, I focused on 1.0 to 4.0 residual.sugar. The data destribution is skewed to right and has many outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] 0.0470653
Chlorides are concentrated around 0.08 and really small standard deviaiton, 0.047.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Since original histogram of total sulfur is skewed to right, I used scale_x_sqrt function to make data more understandable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## [1] 0.001887334
This data has normal destribution. One thing I want to keeo in mind is that it has really small standard deviation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Normal destribution. Its median and mean are almost equal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution is skewed to right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The distribution is smoothly skewed to right.
What_is_the_structure_of_your_dataset? There are 1599 wine data with 12 variables. I deleted X and quality colum and created categorical data$quality colum.
What is/are the main feature(s) of interest in your dataset? As long as I read description of data set, I suspect volatile acidity, residual sugar, and chlorides since these factors seems to directly cause effect on taste of wine. I???d like to determine which features are best for predicting the price of a diamond.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? I square root transformed the right skewed total sulfur distribution. The transformed distribution became more similar to normal distribution shape.
According to correlation matrix, price seems to correlate with volatile.acidity, density, sulphates, and alcohol. #I want to take closer look at scatter plots between them.
Looks like there is a negative correlation
Standard deviation by quality
## data$quality: 3
## [1] 0.002001845
## --------------------------------------------------------
## data$quality: 4
## [1] 0.001575169
## --------------------------------------------------------
## data$quality: 5
## [1] 0.001588504
## --------------------------------------------------------
## data$quality: 6
## [1] 0.002000009
## --------------------------------------------------------
## data$quality: 7
## [1] 0.002175739
## --------------------------------------------------------
## data$quality: 8
## [1] 0.002378276
Density boxplot’s range is mainly overlapped.
ggplot(aes(x=quality,y=sulphates), data=data)+
geom_jitter(alpha=0.3)+
geom_boxplot()
Standard deviaiton of sulphates by quality
## data$quality: 3
## [1] 0.12202
## --------------------------------------------------------
## data$quality: 4
## [1] 0.239391
## --------------------------------------------------------
## data$quality: 5
## [1] 0.1710623
## --------------------------------------------------------
## data$quality: 6
## [1] 0.1586495
## --------------------------------------------------------
## data$quality: 7
## [1] 0.1356389
## --------------------------------------------------------
## data$quality: 8
## [1] 0.1153795
Sulphates variable has many outliers in its boxplot
density and sulphates variables show similar type of distribution in a graph. Both of them change its value within the range of quality, but since its ranges are pretty narrow I am not sure their differences are statistically significant.
## data$quality: 3
## [1] 10
## --------------------------------------------------------
## data$quality: 4
## [1] 53
## --------------------------------------------------------
## data$quality: 5
## [1] 681
## --------------------------------------------------------
## data$quality: 6
## [1] 638
## --------------------------------------------------------
## data$quality: 7
## [1] 199
## --------------------------------------------------------
## data$quality: 8
## [1] 18
Quality 5 and 6 occupy most alcohol values. Even though quality 5 has some outliers, it looks like there is a positive correlation.
The plots are concentrated on low alcohol and relatively low volatile.acidity
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? Alcohol and volatile.acidity correlate with quality.
On the other hand, density and sulphates indicte relatively similar value.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? Density and fixed.acidity show 0.668 correlation. Alcohol and density indicate -0.496 correlation.
What was the strongest relationship you found? Alcohol and volatile.acidity show relativey strong correlation with quality of wine.
Multivariate Plots Section
You can see the color is changing from up-left to bottom-right. In addition, regression shows that volatile.acidity is more important factor to predict quality of wine than alcohol since many colored regression lines are horizontal, which means colored dots are scattered along with volatile.acidity. From this graph the lower the volatile.acidity become, the better the quality gets.
change quality value from factor to integer in order to do multiple regression.
##
## Call:
## lm(formula = data$quality ~ data$volatile.acidity + data$alcohol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59342 -0.40416 -0.07426 0.46539 2.25809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.09547 0.18450 16.78 <2e-16 ***
## data$volatile.acidity -1.38364 0.09527 -14.52 <2e-16 ***
## data$alcohol 0.31381 0.01601 19.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared: 0.317, Adjusted R-squared: 0.3161
## F-statistic: 370.4 on 2 and 1596 DF, p-value: < 2.2e-16
Each independent variables’ p-values are small(<2e-16). Adjusted R-squared is 0.3161. Almost 30 % of quality value is explained by this model.
Multivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
By plotting quality variable with color, I could observed that how volatile.acidity and alcohol correlate with quality.
Were there any interesting or surprising interactions between features? There is a interesting correlation, but it is not that strong compared with diamond’s example.
First of all, from correlation matrix, I chose two variables, volatile.acidity and alcohol since it seems to me that there is correlation with quality in each scatterplot. So, I decided to take closer look at those variables by using cloured boxplot. These boxplots are Plot One and Plot Two.
Although both graphs show good result, from these boxplots, it looks like volatile.acidity is better variable to predict quality variable.
Therefore, I wanted a graph which includes all three variables at one time. I made scatterplot with colured quality and add regressions by each colours.
Surprisingly, by using regression lines, it is obvious that there is a stronger correlation between volatile.acidity and quality than alcohol and quality because most regression lines are drown horizontally.
Description_Three: Color is changing from top-left to bottom-right. Since there is a pattern in this graph, the value of quality can be predictive.
From my research, I confirmed that there is a correlation quality and some variables. Based on my plot three, there is a certain pattern in its graph. Volatile.acidity has stronger correaltion with quality than alchol. In addition, multiple regression made from plot three shows 0.3161 adjusted r-squared.
To be honest, before drawing regression lines in my plot three, I did not think I was doing well since correlations with quality and other variables look really weak. However, by using regression lines in the scatterer plot, my idea is forced to change since there was obviously something to tell the result. Through this courses, I learned how should I look from various points of views. In the future, I would like to learn different points of view to observe data so that I would not miss important points.